Skip to content

Fix potential deadlock in TimeSource::destroy_clock_sub (#2962)#3167

Open
PavelGuzenfeld wants to merge 2 commits into
ros2:rollingfrom
PavelGuzenfeld:fix/time-source-deadlock-2962
Open

Fix potential deadlock in TimeSource::destroy_clock_sub (#2962)#3167
PavelGuzenfeld wants to merge 2 commits into
ros2:rollingfrom
PavelGuzenfeld:fix/time-source-deadlock-2962

Conversation

@PavelGuzenfeld
Copy link
Copy Markdown

@PavelGuzenfeld PavelGuzenfeld commented Jun 6, 2026

Fixes #2962

destroy_clock_sub() held clock_sub_lock_ across clock_executor_thread_.join(). If the executor thread's shutdown path needs clock_sub_lock_, the two deadlock: the main thread waits on the join while the executor thread waits on the lock.

This moves the thread, executor, and callback group into locals under the lock, releases the lock, then joins outside the critical section. The subscription is kept alive until after the join, so the executor thread can finish an in-flight callback without touching a destroyed subscription, and the thread is always joined before the function returns.

Order: cancel → release lock → join → remove_callback_group → reacquire lock → reset subscription.

Adds test_time_source_deadlock.cpp: rapid construct/destroy with use_sim_time=true, concurrent construct/destroy across threads, toggling use_sim_time while spinning, and a check for no lingering thread. If the deadlock is present these hang and trip the 60 s timeout.


Some of this change was produced with Claude Opus 4.6 (Anthropic).

destroy_clock_sub() held clock_sub_lock_ while calling
clock_executor_thread_.join(). If the executor thread's shutdown path
contends on clock_sub_lock_, this produces a deadlock.

Fix: move the thread into a local variable under the lock, release the
lock, then join outside the critical section. The thread is not dangling
— it is unconditionally joined before the function returns.

Key improvement over the naive move approach: the subscription is kept
alive until after join completes, ensuring the executor thread can
finish any in-flight callback without accessing a destroyed subscription.

Cleanup order: cancel → (release lock) → join → remove_callback_group
→ (reacquire lock) → reset subscription.

Fixes ros2#2962

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
Generated-by: Claude Opus 4.6 (Anthropic)
Tests reproduce the conditions from ros2#2962:
- Rapid construct/destroy of nodes with use_sim_time=true (50 cycles)
- Concurrent construct/destroy from 4 threads
- Toggle use_sim_time while spinning
- Verify no lingering thread after destruction

If the deadlock is present, these tests will hang and be caught by the
60-second timeout.

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
Generated-by: Claude Opus 4.6 (Anthropic)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock in destroy_clock_sub waiting for clock_executor_thread_ to join

1 participant